Bregman divergences (BD)
Definition Let
\(\phi:\mathcal{C}\rightarrow\mathbb{R}\) be
a strictly convex and continuously differentiable function defined on a
measurable convex subset \(\mathcal{C}\subset\mathbb{R}^d\) . Let \(int(\mathcal{C})\) denote its relative
interior. A Bregman divergence indexed by \(\phi\) is a dissimilarity measure \(d_{\phi}:\mathcal{C}\times
int(\mathcal{C})\rightarrow\mathbb{R}\) defined for any pair
\((x,y)\in \mathcal{C}\times
int(\mathcal{C})\) by, \[\begin{equation}
\label{eq:1.10}
d_{\phi}(x,y)=\phi(x)-\phi(y)-\langle x-y,\nabla\phi(y)\rangle
\end{equation}\] where \(\nabla\phi(y)\) denotes the gradient of
\(\phi\) computed at a point \(y\in int(\mathcal{C})\) . A Bregman
divergence is not necessarily a metric as it may not be symmetric and
the triangular inequality might not be satisfied.
This section defines all the Bregman divergences used. The list of
all the Bregman divergences is given in the table below:
Euclidean
\({\|x\|_2^2}=\sum_{i=1}^dx_i^2\)
\(\|x-y\|_2^2\)
\(\mathbb{R}^d\)
General Kullback-Leibler
\(\sum_{i=1}^d x_i\ln(
x_i)\)
\(\sum_{i=1}^d( x_i\ln(\frac{
x_i}{y_i})-(x_i-y_i))\)
\((0,+\infty)^d\)
Logistic
\(\sum_{i=1}^d(x_i\ln(
x_i)+(1- x_i)\ln(1- x_i))\)
\(\sum_{i=1}^d\Big(
x_i\ln(\frac{x_i}{y_i})+(1- x_i)\ln(\frac{1-
x_i}{1-y_i})\Big)\)
\((0,1)^d\)
Itakura-Saito
\(-\sum_{i=1}^d\ln(
x_i)\)
\(\sum_{i=1}^d\Big(\frac{
x_i}{y_i}-\ln(\frac{ x_i}{y_i})-1\Big)\)
\((0,+\infty)^d\)
Exponential
\(\sum_{i=1}^de^{x_i}\)
\(\sum_{i=1}^d(e^{x_i}-e^{y_i}-e^{y_i}(x_i-y_i))\)
\(\mathbb{R}^d\)
Polynomial
\(\sum_{i=1}^d|x|^p,p>2\)
\(\sum_{k=1}^d(|x_k|^p-|y_k|^p-\text{sign}(y_k)^pp(x_k-y_k)y_k^{p-1})\)
\(\mathbb{R}^d\)
Look-up list of Bregman
divergences
The codes below provide a look-up list of all the BDs defined in the
table above.
euclidDiv <- function(X., y., deg = NULL){
res <- sweep(X., 2, y.)
return(rowSums(res^2))
}
gklDiv <- function(X., y., deg = NULL){
res <- c("/", "-") %>%
map(.f = ~ sweep(X., 2, y., FUN = .x))
return(rowSums(X.*log(res[[1]]) - res[[2]]))
}
logDiv <- function(X., y., deg = NULL){
res <- map2(.x = list(X., 1-X.),
.y = list(y., 1-y.),
.f = ~ sweep(.x, 2, .y, FUN = "/"))
return(rowSums(X.*log(res[[1]])+(1-X.)*log(res[[2]])))
}
itaDiv <- function(X., y., deg = NULL){
res <- sweep(X., 2, y., FUN = "/")
return(rowSums(res-log(res) - 1))
}
expDiv <- function(X., y., deg = NULL){
exp_y <- exp(y.)
res <- sweep(1+X., 2, y.) %>%
sweep(2, exp_y, FUN = "*")
return(rowSums(exp(X.)-res))
}
polyDiv <- function(X., y., deg = 3){
S <- map2(.x = list(X., X.^deg),
.y = list(y., y.^deg),
.f = ~ sweep(.x,
MARGIN = 2,
STATS = .y,
FUN = "-"))
if(deg %% 2 == 0){
Tem <- sweep(S[[1]], 2, y.^(deg-1), FUN = "*")
res <- rowSums(S[[2]] - deg * Tem)
}
else{
Tem <- sweep(S[[1]], 2, sign(y.) * y.^(deg-1), FUN = "*")
res <- rowSums(S[[2]] - deg * Tem)
}
return(res)
}
lookup_div <- list(
euclidean = euclidDiv,
gkl = gklDiv,
logistic = logDiv,
itakura = itaDiv,
exponential = expDiv,
polynomial = polyDiv
)
Function :
BregmanDiv
This function computes the matrix of BDs between two sets of data
points. Each set of data points should be stored as matrices, data
frame, or tibble object where each row represents an individual data
point.
BregmanDiv <- function(X.,
C.,
div = c("euclidean",
"gkl",
"logistic",
"itakura",
"exponential",
"polynomial"),
deg = 3){
div <- match.arg(div)
d_c <- dim(C.)
if(is.null(d_c)){
C <- matrix(C., nrow = 1, byrow = TRUE)
} else{
C <- as.matrix(C.)
}
if(is.null(dim(X.))){
X <- matrix(X., nrow = 1, byrow = TRUE)
} else{
X <- as.matrix(X.)
}
dis <- map_dfc(.x = 1:dim(C)[1],
.f = ~ tibble('{{.x}}' := lookup_div[[div]](X, C[.x,], deg = deg)))
return(dis)
}
Step \(K\) :
\(K\) -means with Bregman
divergences
This section implements \(K\) -means
algorithm using Bregman divergences which corresponds to the step \(K\) of KFC procedure.
Function :
findClosestCentroid and newCentroids
These two functions perform the main steps of \(K\) -means algorithm.
findClosestCentroid function assigns any data points to
some cluster according to the nearest centroid among all the centroids,
and newCentroids function compute the new centroids of the
partition given the cluster labels of all data points.
findClosestCentroid <- function(x., centroids., div, deg = 3){
dist <- BregmanDiv(x., centroids., div, deg)
clust <- 1:nrow(x.) %>%
map_int(.f = ~ which.min(dist[.x,]))
return(clust)
}
newCentroids <- function(x., clusters.){
centroids <- unique(clusters.) %>%
map_dfr(.f = ~ colMeans(x.[clusters. == .x, ]))
return(centroids)
}
Function :
kmeansBD
This function performs \(K\) -means
algorithm with BDs.
kmeansBD <- function(train_input,
K,
n_start = 5,
maxIter = 500,
deg = 3,
scale_input = FALSE,
div = "euclidean",
splits = 1,
epsilon = 1e-10,
center_ = NULL,
scale_ = NULL,
id_shuffle = NULL){
start_time <- Sys.time()
# Distortion function
X <- as.matrix(train_input)
N <- dim(X)
if(scale_input){
if(!(is.null(center_) & is.null(scale_))){
if(length(center_) == 1){
center_ <- rep(center_, N[2])
}
if(length(scale_) == 1){
scale_ <- rep(scale_, N[2])
}
} else{
min_ <- apply(X, 2, FUN = min)
c_ <- abs(colMeans(X)/5)
center_ <- min_ - c_
scale_ <- apply(X, 2, FUN = max) - center_ + 1
}
X <- scale(X, center = center_, scale = scale_)
}
if(is.null(id_shuffle)){
train_id <- rep(TRUE, N[1])
if(splits < 1){
train_id[sample(N[1], floor(N[1]*(1-splits)))] <- FALSE
}
} else{
train_id <- id_shuffle
}
X_train1 <- X[train_id,]
X_train2 <- X[!train_id,]
mu <- as.matrix(colMeans(X_train1))
distortion <- function(clus){
cent <- newCentroids(X_train1, clus)
var_within <- 1:K %>%
map(.f = ~ BregmanDiv(X_train1[clus == .x,],
cent[.x,],
div,
deg)) %>%
map(.f = sum) %>%
Reduce("+", .)
return(var_within)
}
# Kmeans algorithm
kmeansWithBD <- function(x., k., maxiter., eps.) {
n. <- nrow(x.)
# initialization
init <- sample(n., k.)
centroids_old <- x.[init,]
i <- 0
while(i < maxIter){
# Assignment step
clusters <- findClosestCentroid(x., centroids_old, div, deg)
# Recompute centroids
centroids_new <- newCentroids(x., clusters)
if ((sum(is.na(centroids_new)) > 0) |
(nrow(centroids_new) != k.)) {
init <- sample(n., k.)
centroids_old <- x.[init,]
warning("NA produced -> reinitialize centroids...!")
}
else{
if(sum(abs(centroids_new - centroids_old)) > eps.){
centroids_old <- centroids_new
} else{
break
}
}
i <- i + 1
}
return(clusters)
}
results <- 1:n_start %>%
map_dfc(.f = ~ tibble("{{.x}}" := kmeansWithBD(X_train1,
K,
maxIter,
epsilon)))
opt_id <- 1:n_start %>%
map_dbl(.f = ~ distortion(results[[.x]])) %>%
which.min
cluster <- clusters <- results[[opt_id]]
j <- 1
ID <- unique(cluster)
for (i in ID) {
clusters[cluster == i] = j
j = j + 1
}
centroids = newCentroids(X_train1, clusters)
time_taken <- Sys.time() - start_time
return(
list(
centroids = centroids,
clusters = clusters,
train_data = list(X_train = X_train1,
X_remain = X_train2,
id_remain = !train_id),
parameters = list(div = div,
deg = deg,
center_ = center_,
scale_ = scale_),
running_time = time_taken
)
)
}
Example.1 : We perform \(K\) -means algorithm with "gkl"
BD on Abalone
dataset.
pacman::p_load(readr)
colname <- c("Type", "LongestShell", "Diameter", "Height", "WholeWeight", "ShuckedWeight", "VisceraWeight", "ShellWeight", "Rings")
df <- readr::read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", col_names = colname, delim = ",", show_col_types = FALSE)
n <- nrow(df)
train <- logical(n)
train[sample(n, floor(n*0.8))] <- TRUE
cl <- df[train,2:(ncol(df)-1)] %>%
kmeansBD(K = 3, div = "gkl", splits = 0.5, scale_input = TRUE)
table(cl$clusters)
1 2 3
435 697 539
Step \(F\) :
Fitting predictive models
This section builds global models by fitting local model on each
given cluster of the obtained partition. This corresponds to the step
\(F\) of the procedure.
Function :
fitLocalModels
This function fits local models on all clusters of the obtained
partition.
fitLocalModels <- function(kmeans_BD,
train_response,
model = "lm",
formula = NULL){
start_time <- Sys.time()
X_train <- kmeans_BD$train_data$X_train
y_train <- train_response[!(kmeans_BD$train_data$id_remain)]
X_remain <- kmeans_BD$train_data$X_remain
y_remain <- NULL
if(!is.null(X_remain)){
y_remain <- train_response[kmeans_BD$train_data$id_remain]
}
pacman::p_load(tree)
pacman::p_load(randomForest)
model_ <- ifelse(model == "tree", tree::tree, model)
K <- nrow(kmeans_BD$centroids)
if (is.null(formula)){
form <- formula(target ~ .)
}
else{
form <- update(formula, target ~ .)
}
data_ <- bind_cols(X_train, "target":= y_train)
fit_lookup <- list(lm = "fitted.values",
rf = "predicted")
if(is.character(model_)){
model_lookup <- list(lm = lm,
rf = randomForest::randomForest)
mod <- map(.x = 1:K,
.f = ~ model_lookup[[model_]](formula = form,
data = data_[kmeans_BD$clusters == .x, ]))
} else{
mod <- map(.x = 1:K,
.f = ~ model_(formula = form,
data = data_[kmeans_BD$clusters == .x,]))
}
pred0 <- NULL
if(!is.null(X_remain)){
pred0 <- vector(mode = "numeric",
length = length(y_remain))
clus <- findClosestCentroid(x. = X_remain,
centroids. = kmeans_BD$centroids,
div = kmeans_BD$parameters$div,
deg = kmeans_BD$parameters$deg)
for(i_ in 1:K){
pred0[clus == i_] <- predict(mod[[i_]],
as.data.frame(X_remain[clus == i_,]))
}
}
time_taken <- Sys.time() - start_time
return(list(
local_models = mod,
kmeans_BD = kmeans_BD,
data_remain = list(fit = pred0,
response = y_remain),
running_time = time_taken
))
}
Example.2 : From Example.1 above,
multiple linear regression models are built on all the obtained
clusters. The mean square error of this model, evaluated on the
remaining \(50\%\) of the training data
is computed.
fit <- fitLocalModels(train_response = df$Rings[train],
kmeans_BD = cl,
model = "lm")
mean((fit$data_remain$response- fit$data_remain$fit)^2)
[1] 4.71925
Function :
localPredict
This function allows us to predict any new observation using the
candidate model \(\cal M_j=({\cal
M}_{j,k})_{k=1}^M\) corresponding to Bregman divergence \({\cal B}_j\) , for some \(j\in J\subset\{1,...,M\}\) .
localPredict <- function(localModels,
newData){
kmean_BD <- localModels$kmeans_BD
K <- nrow(kmean_BD$centroids)
newData_ <- newData
if(!(is.null(kmean_BD$parameters$center_))){
newData_ <- scale(newData,
center = kmean_BD$parameters$center_,
scale = kmean_BD$parameters$scale_)
id0 <- newData_ <= 0
if(any(id0)){
min_ <- min(newData_[id0])
newData_[id0] <- runif(sum(id0), min(1e-3, min_/10), min_)
}
}
clus <- findClosestCentroid(x. = newData_,
centroids. = kmean_BD$centroids,
div = kmean_BD$parameters$div,
deg = kmean_BD$parameters$deg)
pred0 <- vector(mode = "numeric", length = nrow(newData_))
for(i_ in 1:K){
pred0[clus == i_] <- predict(localModels$local_models[[i_]],
as.data.frame(newData_[clus == i_,]))
}
pred0 <- as_tibble(pred0)
names(pred0) <- ifelse(kmean_BD$parameters$div == "polynomial",
paste0("polynomial", kmean_BD$parameters$deg),
kmean_BD$parameters$div)
return(pred0)
}
Example.3 : The the performance of the candidate
model corresponding to "gkl" divergence is compared to
random forest regression on a \(20\%\)
testing data.
y_hat <- localPredict(fit,
df[!train, 2:(ncol(df)-1),])
rf <- randomForest(Rings ~ ., data = df[train,2:ncol(df)])
mean((predict(rf, newdata = df[!train,2:ncol(df)])- df$Rings[!train])^2)
[1] 4.290698
mean((y_hat$gkl-df$Rings[!train])^2)
[1] 4.072496
Step \(C\) :
Combining methods
The aggregation methods are available here
.
The aggregation methods are imported into
Rstudio environment.
pacman::p_load(devtools)
### Kernel based consensual aggregation
source_url("https://raw.githubusercontent.com/hassothea/AggregationMethods/main/KernelAggReg.R")
ℹ SHA-1 hash of file is a8312879ba2ed0c5335566dd75fb90751025953c
### MixCobra
source_url("https://raw.githubusercontent.com/hassothea/AggregationMethods/main/MixCobraReg.R")
ℹ SHA-1 hash of file is 63f73453cd5d3e7006bd02ee84e8115300a20178
Function : stepK,
stepF and stepC
These functions allow to set the values of the parameters in the
three steps of the KFC procedure. Each function returns a list of all
the parameters given in its arguments.
stepK = function(K,
n_start = 5,
maxIter = 300,
deg = 3,
scale_input = FALSE,
div = NULL,
splits = 0.75,
epsilon = 1e-10,
center_ = NULL,
scale_ = NULL){
return(list(K = K,
n_start = n_start,
maxIter = maxIter,
deg = deg,
scale_input = scale_input,
div = div,
splits = splits,
epsilon = epsilon,
center_ = center_ ,
scale_ = scale_))
}
stepF = function(formula = NULL,
model = "lm"){
return(list(formula = formula,
model = model))
}
stepC = function(n_cv = 5,
method = c("cobra", "mixcobra"),
opt_methods = c("grad", "grid"),
kernels = "gaussian",
scale_features = FALSE){
return(list(n_cv = n_cv,
method = method,
opt_methods = opt_methods,
kernels = kernels,
scale_features = scale_features))
}
Function : KFCreg
This function is the complete implementation of KFC procedure.
Remark.2 : The parallel argument above
requires internet connection to load the source codes of \(K\) -means algorithm with BDs from GitHub
.
It is performed on the maximum number of clusters of your machine, and
the speed is at least two times faster than without parallelism,
however, it is not so stable depending on your internet connection or
machine. About the aggregation methods of step \(C\) ,
\(^1\) is the kernel-based consensual
aggregation for regression by Has (2021) .
The source codes of these functions are available in KernelAggReg.R ,
and the documentation is available here .
\(^2\) is the aggregation using
input-output trade-off by Fischer
and Mougeot (2019) . The source codes of these functions are
available in MixCobra.R ,
and the documentation is available on here .
KFCreg = function(train_input,
train_response,
test_input,
test_response = NULL,
n_cv = 5,
parallel = TRUE,
inv_sigma = sqrt(.5),
alp = 2,
K_step = stepK(splits = .5),
F_step = stepF(),
C_step = stepC(),
setGradParamAgg = setGradParameter(),
setGridParamAgg = setGridParameter(),
setGradParamMix = setGradParameter_Mix(),
setGridParamMix = setGridParameter_Mix()){
start_time <- Sys.time()
lookup_div_names <- c("euclidean",
"gkl",
"logistic",
"itakura",
"polynomial")
div_ <- K_step$div
### K step: Kmeans clustering with BDs
if (is.null(K_step$div)){
divergences <- lookup_div_names
warning("No divergence provided! All of them are used!")
}
else{
divergences <- K_step$div %>%
map_chr(.f = ~ match.arg(arg = .x,
choices = lookup_div_names))
}
div_list <- divergences %>%
map(.f = (\(x) if(x != "polynomial") return(x) else return(rep("polynomial", length(K_step$deg))))) %>%
unlist
deg_list <- rep(NA, length(div_))
deg_list[div_list == "polynomial"] <- K_step$deg
div_names <- map2_chr(.x = div_list,
.y = deg_list,
.f = (\(x, y) if(is.na(y)) return(x) else return(paste0(x,y))))
### Step K: Kmeans clustering with Bregman divergences
dm <- dim(train_input)
id_shuffle <- vector(length = dm[1])
n_train <- floor(K_step$splits * dm[1])
id_shuffle[sample(dm[1], n_train)] <- TRUE
if(parallel){
numCores <- parallel::detectCores()
doParallel::registerDoParallel(numCores) # use multicore, set to the number of our cores
kmean_ <- foreach(i=1:length(div_names)) %dopar% {
devtools::source_url("https://raw.githubusercontent.com/hassothea/KFC-Procedure/master/kmeanBD.R")
kmeansBD(train_input = train_input,
K = K_step$K,
div = div_list[i],
n_start = K_step$n_start,
maxIter = K_step$maxIter,
deg = deg_list[i],
scale_input = K_step$scale_input,
splits = K_step$splits,
epsilon = K_step$epsilon,
center_ = K_step$center_,
scale_ = K_step$scale_,
id_shuffle = id_shuffle)
}
doParallel::stopImplicitCluster()
} else{
kmean_ <- map2(.x = div_list,
.y = deg_list,
.f = ~ kmeansBD(train_input = train_input,
K = K_step$K,
div = .x,
n_start = K_step$n_start,
maxIter = K_step$maxIter,
deg = .y,
scale_input = K_step$scale_input,
splits = K_step$splits,
epsilon = K_step$epsilon,
center_ = K_step$center_,
scale_ = K_step$scale_,
id_shuffle = id_shuffle))
}
names(kmean_) <- div_names
### F step: Fitting the corresponding model on each observed cluster
model_ <- div_names %>%
map(.f = ~ fitLocalModels(kmean_[[.x]],
train_response = train_response,
model = F_step$model,
formula = F_step$formula))
names(model_) <- div_names
pred_combine <- model_ %>%
map_dfc(.f = ~ .x$data_remain$fit)
y_remain <- train_response[!id_shuffle]
pred_test <- div_names %>%
map_dfc(.f = ~ localPredict(model_[[.x]],
test_input))
names(pred_test) <- names(pred_combine) <- div_names
# C step: Consensual regression aggregation method with kernel-based COBRA
list_method_agg <- list(mixcobra = function(pred){MixCobraReg(train_input = train_input[!id_shuffle,],
train_response = y_remain,
test_input = test_input,
train_predictions = pred,
test_predictions = pred_test,
test_response = test_response,
scale_input = K_step$scale_input,
scale_machine = C_step$scale_features,
n_cv = C_step$n_cv,
inv_sigma = inv_sigma,
alp = alp,
kernels = C_step$kernels,
optimizeMethod = C_step$opt_methods,
setGradParam = setGradParamMix,
setGridParam = setGridParamMix)},
cobra = function(pred){kernelAggReg(train_design = pred,
train_response = y_remain,
test_design = pred_test,
test_response = test_response,
scale_input = K_step$scale_input,
scale_machine = C_step$scale_features,
build_machine = FALSE,
machines = NULL,
n_cv = C_step$n_cv,
inv_sigma = sqrt(.5),
alp = 2,
kernels = C_step$kernels,
optimizeMethod = C_step$opt_methods,
setGradParam = setGradParamAgg,
setGridParam = setGridParamAgg)})
res <- map(.x = C_step$method,
.f = ~ list_method_agg[[.x]](pred_combine))
list_agg_methods <- list(cobra = "cob",
mixcobra = "mix")
names(res) <- C_step$method
ext_fun <- function(L, nam){
tab <- L$fitted_aggregate
names(tab) <- paste0(names(tab), "_", nam)
return(tab)
}
pred_fin <- C_step$method %>%
map_dfc(.f = ~ ext_fun(res[[.x]], list_agg_methods[[.x]]))
time.taken <- Sys.time() - start_time
### To finish
if(is.null(test_response)){
return(list(
predict_final = pred_fin,
predict_local = pred_test,
agg_method = res,
running_time = time.taken
))
} else{
error <- cbind(pred_test, pred_fin) %>%
dplyr::mutate(y_test = test_response) %>%
dplyr::summarise_all(.funs = ~ (. - y_test)) %>%
dplyr::select(-y_test) %>%
dplyr::summarise_all(.funs = ~ mean(.^2))
return(list(
predict_final = pred_fin,
predict_local = pred_test,
agg_method = res,
mse = error,
running_time = time.taken
))
}
}
Example.4 : A complete KFC procedure is implemented
on the same Abalone data, using \(6\)
BDs "euclidean", "itakura",
"gkl", "logistic" and
"polynomial" (of degree \(3\) and \(6\) ). Both aggregation methods are used in
the step \(C\) . Two kernel functions
are used for each aggregation method: "gaussian" (with
gradient descent algorithm) and "epanechnikov" (with grid
search algorithm).
train1 <- logical(n)
train1[sample(n, floor(n*0.8))] <- TRUE
kfc1 <- KFCreg(train_input = df[train1,2:ncol(df)],
train_response = df$Rings[train1],
test_input = df[!train1,2:ncol(df)],
K_step = stepK(K = 3,
scale_input = TRUE,
div = c("eucl", "ita", "gkl", "log" ,"poly"),
deg = c(3, 6),
splits = .5),
C_step = stepC(method = c("cobra", "mixcobra"),
opt_methods = c("grad", "grid"),
kernels = c("gaussian", "gaussian"),
scale_features = FALSE),
setGradParamAgg = setGradParameter(rate = 1),
setGridParamAgg = setGridParameter(min_val = .00001,
max_val = 10,
n_val = 100),
setGradParamMix = setGradParameter_Mix(rate = c(1,1)),
setGridParamMix = setGridParameter_Mix(min_alpha = 1e-10,
max_alpha = 5,
min_beta = 1e-10,
max_beta = 10,
n_alpha = 20,
n_beta = 20))
Kernel-based consensual aggregation method
------------------------------------------
* Gradient descent algorithm ...
Step | Parameter | Gradient | Threshold
---------------------------------------------------
0 | 23.3334 | -2.75e-11 | 1e-10
---------------------------------------------------
1 | 24.3334 | -1.742e-10 | 0.0428571
2 | 30.6667 | 1.467e-10 | 0.2602737
3 | 25.3334 | -1.1e-10 | 0.1739129
4 | 29.3334 | 2.567e-10 | 0.1578946
5 | 20.0000 | -2.75e-11 | 0.3181816
6 | 21.0000 | 2.108e-10 | 0.04999994
7 | 13.3334 | 1.1e-10 | 0.365079
8 | 11.3334 | 0e+00 | 0.1499998
-------------------------------------------------------
Stopped| 11.3334 | 0 | 0
~ Observed parameter: 11.33336 in 8 iterations.
* Grid search algorithm...
~ Observed parameter : 7.47475
MixCobra for regression
-----------------------
* Gradient descent algorithm ...
Step | alpha ; beta | Gradient (alpha ; beta) | Threshold
--------------------------------------------------------------------------------
0 | 7.50002 ; 37.52500 | -2.292e-10 ; -1.1e-10 | 1e-10
--------------------------------------------------------------------------------
1 | 8.5000 ; 38.5250 | -1.467e-10 ; 0e+00 | 0.04441974
2 | 9.1400 ; 38.5250 | -1.375e-10 ; -2.75e-11 | 0.01360977
3 | 9.7400 ; 38.7750 | 2.75e-11 ; -3.667e-11 | 0.01783278
4 | 9.6200 ; 39.1083 | -2.75e-11 ; -1.833e-10 | 0.009344184
5 | 9.7400 ; 40.7750 | -9.167e-11 ; -8.25e-11 | 0.03666585
6 | 10.1400 ; 41.5250 | -1.1e-10 ; 1.1e-10 | 0.0227655
7 | 10.6200 ; 40.5250 | -2.75e-11 ; 2.75e-11 | 0.02864607
8 | 10.7400 ; 40.4000 | 3.667e-11 ; -2.292e-10 | 0.0047903
9 | 10.5800 ; 41.4417 | 0e+00 ; 1.375e-10 | 0.02349758
10 | 10.5800 ; 41.1292 | -1.375e-10 ; -3.759e-10 | 0.00600711
11 | 10.7300 ; 41.5562 | 1.1e-10 ; 2.017e-10 | 0.01116017
12 | 10.6700 ; 41.4417 | -1.1e-10 ; 0e+00 | 0.00333899
13 | 10.7000 ; 41.4417 | -1.192e-10 ; -2.75e-11 | 0.0005756866
14 | 10.7163 ; 41.4456 | -1.1e-10 ; 0e+00 | 0.0003865669
15 | 10.7313 ; 41.4456 | -2.75e-11 ; 3.667e-11 | 0.0002875665
16 | 10.7350 ; 41.4443 | -1.1e-10 ; -1.1e-10 | 9.682615e-05
17 | 10.7500 ; 41.4462 | 1.192e-10 ; 2.108e-10 | 0.0003249014
18 | 10.7338 ; 41.4444 | 0e+00 ; -9.167e-11 | 0.0003471848
19 | 10.7338 ; 41.4448 | -1.558e-10 ; -1.1e-10 | 7.798307e-06
20 | 10.7391 ; 41.4450 | -3.667e-11 ; 5.5e-11 | 0.0001064928
21 | 10.7397 ; 41.4449 | 9.167e-11 ; 2.75e-11 | 1.431605e-05
22 | 10.7382 ; 41.4449 | 0e+00 ; -1.833e-10 | 3.052659e-05
23 | 10.7382 ; 41.4451 | -1.1e-10 ; -3.667e-11 | 3.898789e-06
24 | 10.7386 ; 41.4451 | -1.375e-10 ; 9.167e-11 | 9.372653e-06
25 | 10.7389 ; 41.4450 | -1.192e-10 ; -3.667e-11 | 6.588866e-06
26 | 10.7392 ; 41.4450 | -1.192e-10 ; 3.667e-11 | 5.060538e-06
27 | 10.7394 ; 41.4450 | 1.192e-10 ; 2.75e-11 | 4.963045e-06
28 | 10.7392 ; 41.4450 | -3.667e-11 ; 6.417e-11 | 4.902104e-06
29 | 10.7392 ; 41.4450 | 2.75e-11 ; 7.334e-11 | 8.338345e-07
30 | 10.7392 ; 41.4450 | 2.75e-11 ; 1.192e-10 | 3.781737e-07
31 | 10.7392 ; 41.4450 | -1.1e-10 ; -9.167e-11 | 2.987378e-07
32 | 10.7392 ; 41.4450 | -1.467e-10 ; -1.467e-10 | 6.832472e-07
33 | 10.7392 ; 41.4450 | -1.1e-10 ; 3.667e-11 | 4.717424e-07
34 | 10.7392 ; 41.4450 | -9.167e-11 ; 2.75e-11 | 3.050729e-07
35 | 10.7393 ; 41.4450 | -1.558e-10 ; -2.108e-10 | 2.430593e-07
36 | 10.7393 ; 41.4450 | 3.667e-11 ; 0e+00 | 4.677214e-07
37 | 10.7393 ; 41.4450 | 1.375e-10 ; 1.467e-10 | 9.356861e-08
38 | 10.7393 ; 41.4450 | -1.192e-10 ; -3.667e-11 | 1.876246e-07
39 | 10.7393 ; 41.4450 | 0e+00 ; 0e+00 | 1.53572e-07
40 | 10.7393 ; 41.4450 | 0e+00 ; 0e+00 | 1.558412e-10
--------------------------------------------------------------------------------
Stopped| 10.7393 ; 41.4450 | 0 | 0
~ Observed parameter: (alpha, beta) = ( 10.73927 , 41.44502 ) in 40 itertaions.
* Grid search algorithm...
~ Observed parameter: (alpha, beta) = (1e-10, 7.368421)
The mean square errors evaluated on \(20\%\) -testing data of the above
computation are reported below.
kfc1$predict_final %>%
mutate(rf = predict(rf1, newdata = df[!train1,2:ncol(df)])) %>%
sweep(MARGIN = 1, STATS = df$Rings[!train1], FUN = "-") %>%
.^2 %>%
colMeans
gaussian_grad_cob gaussian_grid_cob gaussian_grad_mix gaussian_grid_mix rf
6.453406e-28 2.144367e-20 1.803888e+00 5.635831e+00 4.603206e+00
📖 Read also KernelAggReg and MixCobraReg .
---
title: "<span style='color: #1C81AA;'>**KFC procedure for regression -</span> [Has et al. (2021)](https://www.tandfonline.com/eprint/YKGS8GTKDBKYFXEGFWSB/full?target=10.1080/00949655.2021.1891539)**"
author: "<span style='color: #D4A51C;'>***Sothea Has***</span>"
date: "5/20/2022"
output:
  html_document:
    css: hideOutput.css
    includes:
      in_header: hideOutput.script
    df_print: paged
    code_folding: hide
    number_sections: yes
    toc: yes
    toc_depth: '2'
    tocdepth: 2
  html_notebook:
    css: hideOutput.css
    includes:
      in_header: hideOutput.script
    code_folding: hide
    number_sections: yes
    toc: yes
    toc_depth: 2
    tocdepth: 2
  pdf_document:
    toc: yes
    toc_depth: '2'
---

<style>
  .btn {
    border-width: 0 0px 0px 0px;
    font-weight: normal;
    text-transform: ;
  }
.btn-default {
  color: #2ecc71;
    background-color: #ffffff;
    border-color: #ffffff;
}
</style>

<!-- Colors
blue : #1FAAE3
yellow : #F0AE14
green : #54D319 
red : #E6180A
-->


```{r, echo=FALSE}
# Check if package "fontawesome" is already installed 

lookup_packages <- installed.packages()[,1]
if(!("fontawesome" %in% lookup_packages))
  install.packages("fontawesome")
```


<span style="color: #1FAAE3;">&#128270;<u> How to download & run the codes?</u></span>{-}
===

All the source codes of the aggregation methods are available [here <span style="color: #097BC1"> `r fontawesome::fa("github")`</span>](https://github.com/hassothea/AggregationMethods). To run the codes, you can <span style="color: #097BC1">`clone`</span> the repository directly or simply load the <span style="color: #097BC1">`R script`</span> source file from the repository using [devtools](https://cran.r-project.org/web/packages/devtools/index.html) package in <span style="color: #0287D8;"> **Rstudio** </span> as follow:

1. Install [devtools](https://cran.r-project.org/web/packages/devtools/index.html) package using command: 

    `install.packages("devtools")`

2. Loading the source codes from <span style="color: #097BC1">GitHub `r fontawesome::fa("github")`</span> repository using `source_url` function by: 

    `devtools::source_url("https://raw.githubusercontent.com/hassothea/AggregationMethods/main/KernelAggReg.R")`

---

> **&#9998; Note**: All codes contained in this `Rmarkdown` are built with recent version of <span style="color: #097BC1">`r fontawesome::fa("r-project")`</span> (version $>$ 4.1, available [here](https://cran.r-project.org/bin/windows/base/)) and <span style="color: #0287D8;"> **Rstudio** </span> (version > `2022.02.2+485`, available [here](https://www.rstudio.com/products/rstudio/download/#download)). Note also that the code chucks are <span style="color: #E6180A;">hidden</span> by default.

<span style="color: #F0AE14"> **To see the codes, you can:** </span>

- click on the top-right <span style="color: #54D319 ;">`Code`</span> button of the page, then choose **Show All Code** to show all the codes, or 
- simply click on the right-corner <span style="color: #54D319 ;">`Code`</span> button at each section to show the codes of that specific section.

---

<span style="color: #1FAAE3;"><u>KFC procedure & important packages </u></span>
===

<span style="color: #F0AE14;"><u>KFC procedure</u></span>
---

KFC procedure is a three-step methodology which puts together clustering and consensual aggregation methods to construct predictions in supervised learning problems. The three steps of the procedure are:

- **Step *K* **: $K$-means clustering algorithm is implemented on the input data using several options of Bregman divergences $({\cal B}_j)_{j=1}^M$ ($M$ is the number of total divergences used), therefore, the input data is partitioned into many different structures, according to the Bregman divergence used.
- **Step *F* **: For a partition structure given by a divergence ${\cal B}_j$, we fit simple models (linear, for example) on all the clusters of the obtained partition. Then, the collection ${\cal M}_j=\{{\cal M}_{j,k}\}_{k=1}^K$ of these local models is called *candidate* model, corresponding to the Bregman divergence ${\cal B}_j$. At the end of this step, several candidate models are constructed. 
- **Step *C* **: This step aggregates the obtained candidate models using consensual aggregation methods studied in [Has (2021)](https://hal.archives-ouvertes.fr/hal-02884333v5) or [Fischer and Mougeot (2019)](https://www.sciencedirect.com/science/article/pii/S0378375818302349).


---

>**Remark.1**: 
The prediction of any observation $x$ given by a candidate model ${\cal M}_j$ is done in two simple steps:

1. $x$ is classified into one of the obtained clusters using the corresponding divergence ${\cal B}_j$, i.e.,
  $$x\in{\cal C}_{k^*} \Leftrightarrow {\cal B}_j(c_{k^*},x)=\inf_{1\leq k\leq K}{\cal B}_j(c_k,x)$$
where $\{c_1,...,c_K\}_{k=1}^K$ are the centroids of the corresponding clusters $\{C_1,...,C_K\}$.

2. The prediction of $x$ is given by the corresponding local model ${\cal M}_{j,k^*}$ defined on cluster $k^*$, i.e., ${\cal M}_j(x)={\cal M}_{j,k^*}(x)$.

---

![The figure above represents the summary of KFC procedure](./kfc.png)

<span style="color: #F0AE14;"><u>Important packages</u></span>
---

We prepare all the necessary tools for this `Rmarkdown`. `pacman` package allows us to load (if exists) or install (if does not exist) any available packages from [The Comprehensive R Archive Network (CRAN)](https://cran.r-project.org/) of <span style="color: #097BC1">`r fontawesome::fa("r-project")`</span>. 


```{r}
# Check if package "pacman" is already installed 

lookup_packages <- installed.packages()[,1]
if(!("pacman" %in% lookup_packages))
  install.packages("pacman")


# To be installed or loaded
pacman::p_load(magrittr)
pacman::p_load(tidyverse)

## package for "generateMachines"
pacman::p_load(tree)
pacman::p_load(glmnet)
pacman::p_load(randomForest)
pacman::p_load(FNN)
pacman::p_load(xgboost)
pacman::p_load(keras)
pacman::p_load(pracma)
pacman::p_load(latex2exp)
pacman::p_load(plotly)
pacman::p_load(parallel)
pacman::p_load(foreach)
pacman::p_load(doParallel)
rm(lookup_packages)
```


<span style="color: #1FAAE3;"><u>Bregman divergences (BD)</u></span>
===

<span style="color: #1FAAE3;">**Definition**</span> Let $\phi:\mathcal{C}\rightarrow\mathbb{R}$ be a strictly convex and continuously differentiable function defined on a measurable convex subset $\mathcal{C}\subset\mathbb{R}^d$. Let $int(\mathcal{C})$ denote its relative interior. A Bregman divergence indexed by $\phi$ is a dissimilarity measure $d_{\phi}:\mathcal{C}\times int(\mathcal{C})\rightarrow\mathbb{R}$ defined for any pair $(x,y)\in \mathcal{C}\times int(\mathcal{C})$ by,
\begin{equation}
\label{eq:1.10}
d_{\phi}(x,y)=\phi(x)-\phi(y)-\langle x-y,\nabla\phi(y)\rangle 
\end{equation}
where $\nabla\phi(y)$ denotes the gradient of $\phi$ computed at a point $y\in int(\mathcal{C})$. A Bregman divergence is not necessarily a metric as it may not be symmetric and the triangular inequality might not be satisfied.

This section defines all the Bregman divergences used. The list of all the Bregman divergences is given in the table below:

----

Name                        $\phi$                                        $d_{\phi}$                                                                                                    $\cal C$
------------------------- ----------------------------------------------- ----------------------------------------------------------------------------------- -----------
Euclidean                 ${\|x\|_2^2}=\sum_{i=1}^dx_i^2$                   $\|x-y\|_2^2$                                                                     $\mathbb{R}^d$
General Kullback-Leibler  $\sum_{i=1}^d x_i\ln( x_i)$                       $\sum_{i=1}^d( x_i\ln(\frac{ x_i}{y_i})-(x_i-y_i))$                               $(0,+\infty)^d$                                  
Logistic                  $\sum_{i=1}^d(x_i\ln( x_i)+(1- x_i)\ln(1- x_i))$  $\sum_{i=1}^d\Big( x_i\ln(\frac{x_i}{y_i})+(1- x_i)\ln(\frac{1- x_i}{1-y_i})\Big)$   $(0,1)^d$
Itakura-Saito             $-\sum_{i=1}^d\ln( x_i)$                          $\sum_{i=1}^d\Big(\frac{ x_i}{y_i}-\ln(\frac{ x_i}{y_i})-1\Big)$                    $(0,+\infty)^d$
Exponential               $\sum_{i=1}^de^{x_i}$                             $\sum_{i=1}^d(e^{x_i}-e^{y_i}-e^{y_i}(x_i-y_i))$                                    $\mathbb{R}^d$
Polynomial                $\sum_{i=1}^d|x|^p,p>2$                           $\sum_{k=1}^d(|x_k|^p-|y_k|^p-\text{sign}(y_k)^pp(x_k-y_k)y_k^{p-1})$             $\mathbb{R}^d$

----

<span style="color: #F0AE14;"><u>Look-up list of Bregman divergences</u></span>
----

The codes below provide a look-up list of all the BDs defined in the table above.

```{r}
euclidDiv <- function(X., y., deg = NULL){
    res <- sweep(X., 2, y.)
    return(rowSums(res^2))
}
gklDiv <- function(X., y., deg = NULL){
  res <- c("/", "-") %>%
    map(.f = ~ sweep(X., 2, y., FUN = .x))
  return(rowSums(X.*log(res[[1]]) - res[[2]]))
}
logDiv <- function(X., y., deg = NULL){
    res <-  map2(.x = list(X., 1-X.),
                 .y = list(y., 1-y.),
                 .f = ~ sweep(.x, 2, .y, FUN = "/"))
    return(rowSums(X.*log(res[[1]])+(1-X.)*log(res[[2]])))
}
itaDiv <- function(X., y., deg = NULL){
    res <- sweep(X., 2, y., FUN = "/")
    return(rowSums(res-log(res) - 1))
}
expDiv <- function(X., y., deg = NULL){
    exp_y <- exp(y.)
    res <- sweep(1+X., 2, y.) %>%
      sweep(2, exp_y, FUN = "*")
    return(rowSums(exp(X.)-res))
}
polyDiv <- function(X., y., deg = 3){
    S <- map2(.x = list(X., X.^deg),
              .y = list(y., y.^deg),
              .f = ~ sweep(.x, 
                           MARGIN = 2, 
                           STATS = .y,
                           FUN = "-"))
    if(deg %% 2 == 0){
      Tem <- sweep(S[[1]], 2, y.^(deg-1), FUN = "*")
      res <- rowSums(S[[2]] - deg * Tem)
    }
    else{
      Tem <- sweep(S[[1]], 2, sign(y.) * y.^(deg-1), FUN = "*")
      res <- rowSums(S[[2]] - deg * Tem)
    }
    return(res)
}
lookup_div <- list(
  euclidean = euclidDiv,
  gkl = gklDiv,
  logistic = logDiv,
  itakura = itaDiv,
  exponential = expDiv,
  polynomial = polyDiv
)
```


<span style="color: #F0AE14;"><u>Function</u></span> : `BregmanDiv`
----

This function computes the matrix of BDs between two sets of data points. Each set of data points should be stored as matrices, data frame, or tibble object where each row represents an individual data point.

- **Argument**:

    - `X.`, `C.` : The data matrices, tibbles and data frames, containing the data points for which the Bregman divergences between them are to be computed.
    - `div` : the type of divergence to be used. It should be a subset of `{"euclidean", "gkl", "logistic", "itakura", "exponential", "polynomial"}`.
    - `deg` : the degree of polynomial BD (if one is used).
- **Value**: 
    
    This function returns a *tibble* object $D=(d_{i,j})$ where $d_{i,j}$ is the Bregman divergence between row $i$ of `X.` and row $j$ of `C.`.


```{r}
BregmanDiv <- function(X., 
                       C., 
                       div = c("euclidean",
                                "gkl",
                                "logistic",
                                "itakura",
                                "exponential",
                                "polynomial"),
                       deg = 3){
  div <- match.arg(div)
  d_c <- dim(C.)
  if(is.null(d_c)){
    C <- matrix(C., nrow = 1, byrow = TRUE)
  } else{
    C <- as.matrix(C.)
  }
  if(is.null(dim(X.))){
    X <- matrix(X., nrow = 1, byrow = TRUE)
  } else{
    X <- as.matrix(X.)
  }
  dis <-  map_dfc(.x = 1:dim(C)[1],
                  .f = ~ tibble('{{.x}}' := lookup_div[[div]](X, C[.x,], deg = deg)))
  return(dis)
}
```

<span style="color: #1FAAE3;"><u>Step $K$: $K$-means with Bregman divergences</u></span>
===

This section implements $K$-means algorithm using Bregman divergences which corresponds to the step $K$ of KFC procedure.

<span style="color: #F0AE14;"><u>Function</u></span> : `findClosestCentroid` and `newCentroids`
----

These two functions perform the main steps of $K$-means algorithm. `findClosestCentroid` function assigns any data points to some cluster according to the nearest centroid among all the centroids, and `newCentroids` function compute the new centroids of the partition given the cluster labels of all data points.

- **Argument**:

    - `x.` : The data matrices, tibbles and data frames, containing the data points to be assigned to some cluster.
    - `centroids` : The matrix or data frame containing the centroids (by row).
    - `div` : the type of divergence to be used.
    - `deg` : the degree of polynomial BD (if one is used).
    
- **Value**: 

    The each of the two functions returns arguments for one another:
    
    - `findClosestCentroid` returns a vector of size equals to the number of rows of data matrix `x.`, containing the cluster labels of the data points.
    - `newCentroids` returns the matrix of centroids.

```{r}
findClosestCentroid <- function(x., centroids., div, deg = 3){
  dist <- BregmanDiv(x., centroids., div, deg)
  clust <- 1:nrow(x.) %>%
    map_int(.f = ~ which.min(dist[.x,]))
  return(clust)
}
newCentroids <- function(x., clusters.){
  centroids <- unique(clusters.) %>%
    map_dfr(.f = ~ colMeans(x.[clusters. == .x, ]))
  return(centroids)
}
```

<span style="color: #F0AE14;"><u>Function</u></span> : `kmeansBD`
----

This function performs $K$-means algorithm with BDs. 

- **Argument**:

    - `train_input` : The data matrices, tibbles and data frames, containing the data points.
    - `K` : The number of clusters.
    - `n_start` : the number of times to perform the algorithm, and the best one among them is chosen to be the final result. This is done to avoid local optimal solutions. By default, `n_start = 5`.
    - `maxIter` : the maximum number of iterations in case the algorithm does not converge. By default, `maxIter = 500`.
    - `deg` : the degree of polynomial BD (if one is used).
    - `scale_input` : logical value controlling whether to scale the input to be in $(0,1)$ or not. By default, `scale_input = FALSE`.
    - `div` : the type of divergence to be used. By default, `div = "euclidean"` and the usual $K$-means algorithm is performed.
    - `splits` : real number between $0$ and $1$ specifying the proportion of training data to be used to perform $K$-means algorithm. The remaining part with be used for the aggregation. By default, `splits = 1` and all the input data are used.
    - `epsilon` : the stopping threshold criterion to stop the algorithm. By default, `epsilon = 1e-10`.
    - `center_`, `scale_` : the center and scale to be used to scale the input data. By default, they are `NULL`.
    - `id_shuffle` : a logical vector specifying which part of the training data will be selected to perform the algorithm. This is important when we want to perform the algorithm on the same set of data points but with different BDs. 
    
- **Value**: 

    This function returns a list of the following objects:
    - `centroids` : the matrix of the obtained centroids.
    - `clusters` : a vector of cluster label of the data points.
    - `train_data` : list of the following important objects
        - `X_train` : the training data used for the algorithm.
        - `X_remain` : the remaining part of the input data used for the aggregation.
        - `id_remain` : the logical vector specifying the remaining part (`X_remain`) of the input data.
    - `parameters` : the list of the following objects:
        - `div` : divergence used.
        - `deg` : the degree of polynomial BD (if one is used).
        - `center_`, `scale_` : the center and scale used to scale the input data.
    - `running_time`: the computational time of the algorithm.

```{r}
kmeansBD <- function(train_input,
                     K,
                     n_start = 5,
                     maxIter = 500,
                     deg = 3,
                     scale_input = FALSE,
                     div = "euclidean",
                     splits = 1,
                     epsilon = 1e-10,
                     center_ = NULL,
                     scale_ = NULL,
                     id_shuffle = NULL){
  start_time <- Sys.time()
  # Distortion function
  X <- as.matrix(train_input)
  N <- dim(X)
  if(scale_input){
    if(!(is.null(center_) & is.null(scale_))){
      if(length(center_) == 1){
        center_ <- rep(center_, N[2])
      }
      if(length(scale_) == 1){
        scale_ <- rep(scale_, N[2])
      }
    } else{
      min_ <- apply(X, 2, FUN = min)
      c_ <- abs(colMeans(X)/5)
      center_ <- min_ - c_
      scale_ <- apply(X, 2, FUN = max) - center_ + 1
    }
    X <- scale(X, center = center_, scale = scale_)
  }
  if(is.null(id_shuffle)){
    train_id <- rep(TRUE, N[1])
    if(splits < 1){
      train_id[sample(N[1], floor(N[1]*(1-splits)))] <- FALSE
    }
  } else{
    train_id <- id_shuffle
  }
  X_train1 <- X[train_id,]
  X_train2 <- X[!train_id,]
  mu <- as.matrix(colMeans(X_train1))
  distortion <- function(clus){
    cent <- newCentroids(X_train1, clus)
    var_within <- 1:K %>%
      map(.f = ~ BregmanDiv(X_train1[clus == .x,], 
                            cent[.x,], 
                            div, 
                            deg)) %>%
      map(.f = sum) %>%
      Reduce("+", .)
    return(var_within)
  }
  # Kmeans algorithm
  kmeansWithBD <- function(x., k., maxiter., eps.) {
    n. <- nrow(x.)
    # initialization
    init <- sample(n., k.)
    centroids_old <- x.[init,]
    i <- 0
    while(i < maxIter){
      # Assignment step
      clusters <- findClosestCentroid(x., centroids_old, div, deg)
      # Recompute centroids
      centroids_new <- newCentroids(x., clusters)
      if ((sum(is.na(centroids_new)) > 0) |
          (nrow(centroids_new) != k.)) {
        init <- sample(n., k.)
        centroids_old <- x.[init,]
        warning("NA produced -> reinitialize centroids...!")
      }
      else{
        if(sum(abs(centroids_new - centroids_old)) > eps.){
          centroids_old <- centroids_new
        } else{
          break
        }
      }
      i <- i + 1
    }
    return(clusters)
  }
  results <- 1:n_start %>% 
    map_dfc(.f = ~ tibble("{{.x}}" := kmeansWithBD(X_train1, 
                                                   K,
                                                   maxIter, 
                                                   epsilon)))
  opt_id <- 1:n_start %>%
    map_dbl(.f = ~ distortion(results[[.x]])) %>%
    which.min
  cluster <- clusters <- results[[opt_id]]
  j <- 1
  ID <- unique(cluster)
  for (i in ID) {
    clusters[cluster == i] = j
    j =  j + 1
  }
  centroids = newCentroids(X_train1, clusters)
  time_taken <- Sys.time() - start_time
  return(
    list(
      centroids = centroids,
      clusters = clusters,
      train_data = list(X_train = X_train1,
                        X_remain = X_train2,
                        id_remain = !train_id),
      parameters = list(div = div,
                        deg = deg,
                        center_ = center_,
                        scale_ = scale_),
      running_time = time_taken
    )
  )
}
```

----

> **Example.1**: We perform $K$-means algorithm with `"gkl"` BD on [Abalone](https://archive.ics.uci.edu/ml/machine-learning-databases/abalone) dataset.

----

```{r}
pacman::p_load(readr)
colname <- c("Type", "LongestShell", "Diameter", "Height", "WholeWeight", "ShuckedWeight", "VisceraWeight", "ShellWeight", "Rings")
df <- readr::read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data", col_names = colname, delim = ",", show_col_types = FALSE)
n <- nrow(df)
train <- logical(n)
train[sample(n,  floor(n*0.8))] <- TRUE
cl <- df[train,2:(ncol(df)-1)] %>%
  kmeansBD(K = 3, div = "gkl", splits = 0.5, scale_input = TRUE)
table(cl$clusters)
```


<span style="color: #1FAAE3;"><u>Step $F$: Fitting predictive models</u></span>
===

This section builds global models by fitting local model on each given cluster of the obtained partition. This corresponds to the step $F$ of the procedure.

<span style="color: #F0AE14;"><u>Function</u></span> : `fitLocalModels`
----

This function fits local models on all clusters of the obtained partition.

- **Argument**:

    - `kmeans_BD` : An object obtained from `kmeansBD` function.
    - `train_response` : The vector of response variable corresponding to the full `input_data` given to `kmeanBD` function.
    - `model` : Type of local model to fit on all the clusters of the given partition. It should be either a model object which is compactible
    - `formula` : The degree of polynomial BD (if one is used).

- **Value**:

    This function returns a list of the following objects:
    - `local_models` : all the local models fitted on all clusters of the given partition.
    - `kmeans_BD` : the `kmeansBD` object.
    - `data_remain` : a list of the following object:
        - `fit` : the fitted values of the remaining part of the input data.
        - `response` : the actual response values corresponding to the remaining input data.
    - `running_time` : the computational time of the algorithm.

```{r}
fitLocalModels <- function(kmeans_BD,
                           train_response,
                           model = "lm",
                           formula = NULL){
  start_time <- Sys.time()
  X_train <- kmeans_BD$train_data$X_train
  y_train <- train_response[!(kmeans_BD$train_data$id_remain)]
  X_remain <- kmeans_BD$train_data$X_remain
  y_remain <- NULL
  if(!is.null(X_remain)){
    y_remain <- train_response[kmeans_BD$train_data$id_remain]
  }
  pacman::p_load(tree)
  pacman::p_load(randomForest)
  model_ <- ifelse(model == "tree", tree::tree, model)
  K <- nrow(kmeans_BD$centroids)
  if (is.null(formula)){
    form <- formula(target ~ .)
  }
  else{
    form <- update(formula, target ~ .)
  }
  data_ <- bind_cols(X_train, "target":= y_train)
  fit_lookup <- list(lm = "fitted.values",
                     rf = "predicted")
  if(is.character(model_)){
    model_lookup <- list(lm = lm,
                         rf = randomForest::randomForest)
    mod <- map(.x = 1:K, 
               .f = ~ model_lookup[[model_]](formula = form, 
                                             data = data_[kmeans_BD$clusters == .x, ]))
  } else{
    mod <- map(.x = 1:K, 
               .f = ~ model_(formula = form, 
                             data = data_[kmeans_BD$clusters == .x,]))
  }
  pred0 <- NULL
  if(!is.null(X_remain)){
    pred0 <- vector(mode = "numeric", 
                    length = length(y_remain))
    clus <- findClosestCentroid(x. = X_remain,
                                centroids. = kmeans_BD$centroids,
                                div = kmeans_BD$parameters$div,
                                deg = kmeans_BD$parameters$deg)
    for(i_ in 1:K){
      pred0[clus == i_] <- predict(mod[[i_]],
                                   as.data.frame(X_remain[clus == i_,]))
    }
  }
  time_taken <- Sys.time() - start_time
  return(list(
    local_models = mod,
    kmeans_BD = kmeans_BD,
    data_remain = list(fit = pred0,
                       response = y_remain),
    running_time = time_taken
  ))
}
```

---- 

> **Example.2**: From **Example.1** above, multiple linear regression models are built on all the obtained clusters. The mean square error of this model, evaluated on the remaining $50\%$ of the training data is computed.

----

```{r}
fit <- fitLocalModels(train_response = df$Rings[train],
                      kmeans_BD = cl,
                      model = "lm")

mean((fit$data_remain$response- fit$data_remain$fit)^2)
```

<span style="color: #F0AE14;"><u>Function</u></span> : `localPredict`
----

This function allows us to predict any new observation using the candidate model $\cal M_j=({\cal M}_{j,k})_{k=1}^M$ corresponding to Bregman divergence ${\cal B}_j$, for some $j\in J\subset\{1,...,M\}$. 

- **Argument**:

    - `localModels` : The local model object obtained from `fitLocalModels` function.
    - `newData` : New input data to be predicted using the candidate models given in `localModels` argument.
    
- **Value**:

    This function returns a predicted vector of the `newData`.

```{r}
localPredict <- function(localModels,
                         newData){
  kmean_BD <- localModels$kmeans_BD
  K <- nrow(kmean_BD$centroids)
  newData_ <- newData
  if(!(is.null(kmean_BD$parameters$center_))){
    newData_ <- scale(newData,
                      center = kmean_BD$parameters$center_,
                      scale = kmean_BD$parameters$scale_)
    id0 <- (newData_ <= 0)
    if(sum(id0) > 0){
      min_ <- min(newData_[id0])
      newData_[id0] <- runif(sum(id0), min(1e-3, min_/10), min_)
    }
  }
  clus <- findClosestCentroid(x. = newData_,
                              centroids. = kmean_BD$centroids,
                              div = kmean_BD$parameters$div,
                              deg = kmean_BD$parameters$deg)
  pred0 <- vector(mode = "numeric", length = nrow(newData_))
  for(i_ in 1:K){
    pred0[clus == i_] <- predict(localModels$local_models[[i_]],
                                 as.data.frame(newData_[clus == i_,]))
  }
  pred0 <- as_tibble(pred0)
  names(pred0) <- ifelse(kmean_BD$parameters$div == "polynomial",
                         paste0("polynomial", kmean_BD$parameters$deg),
                         kmean_BD$parameters$div)
  return(pred0)
}
```

----

> **Example.3**: The the performance of the candidate model corresponding to `"gkl"` divergence is compared to random forest regression on a $20\%$ testing data.

----

```{r}
y_hat <- localPredict(fit,
                      df[!train, 2:(ncol(df)-1),])
rf <- randomForest(Rings ~ ., data = df[train,2:ncol(df)])
mean((predict(rf, newdata = df[!train,2:ncol(df)])- df$Rings[!train])^2)
mean((y_hat$gkl-df$Rings[!train])^2)
```


<span style="color: #1FAAE3;"><u>Step $C$: Combining methods</u></span>
===

The aggregation methods are available [here <span style="color: #097BC1"> `r fontawesome::fa("github")`</span>](https://github.com/hassothea/AggregationMethods). The aggregation methods are imported into <span style="color: #0287D8;"> **Rstudio** </span> environment.

```{r, warning=FALSE}
pacman::p_load(devtools)
### Kernel based consensual aggregation
source_url("https://raw.githubusercontent.com/hassothea/AggregationMethods/main/KernelAggReg.R")
### MixCobra
source_url("https://raw.githubusercontent.com/hassothea/AggregationMethods/main/MixCobraReg.R")
```

<span style="color: #F0AE14;"><u>Function</u></span> : `stepK`, `stepF` and `stepC`
----

These functions allow to set the values of the parameters in the three steps of the KFC procedure. Each function returns a list of all the parameters given in its arguments.

```{r}
stepK = function(K,
                 n_start = 5,
                 maxIter = 300,
                 deg = 3,
                 scale_input = FALSE,
                 div = NULL,
                 splits = 0.75,
                 epsilon = 1e-10,
                 center_ = NULL,
                 scale_ = NULL){
  return(list(K = K,
              n_start = n_start,
              maxIter = maxIter,
              deg = deg,
              scale_input = scale_input,
              div = div,
              splits = splits,
              epsilon = epsilon,
              center_ = center_ ,
              scale_ = scale_))
}

stepF = function(formula = NULL, 
                 model = "lm"){
  return(list(formula = formula, 
              model = model))
}

stepC = function(n_cv = 5,
                 method = c("cobra", "mixcobra"),
                 opt_methods = c("grad", "grid"),
                 kernels = "gaussian",
                 scale_features = FALSE){
  return(list(n_cv = n_cv,
              method = method,
              opt_methods = opt_methods,
              kernels = kernels,
              scale_features = scale_features))
}
```


<span style="color: #1FAAE3;"><u>Function</u></span>: `KFCreg`
===

This function is the complete implementation of KFC procedure.

- **Argument**:

    - `train_input` : The matrix or data frame of training input data.
    - `train_response` : The training response variable.
    - `test_input`: The testing input data.
    - `test_response` : The response variable of the testing data. It is optional. If it is not `NULL`, the mean square error (mse) is computed.
    - `n_cv` : The number of folds in cross-validation.
    - `parallel` : A logical value specifying whether or not to perform the step $K$ in parallel. By default, `parallel = TRUE`, and note that internet connection is required in this case.
    - `inv_sigma`, `alp` : the inverse normalized constant $\sigma^{-1}>0$ and the exponent $\alpha>0$ of exponential kernel: $K(x)=e^{-\|x/\sigma\|^{\alpha}}$ for any $x\in\mathbb{R}^d$. By default, `inv_sigma = `$\sqrt{1/2}$ and `alpha = 2` which corresponds to the Gaussian kernel.
    - `K_step` : The object obtained from function `stepK` which allowing to set the values of the paramters in step $K$ of the procedure.
    - `F_step` : The object obtained from function `stepF` which allowing to set the values of the paramters in step $F$ of the procedure.
    - `C_step` : The object obtained from function `stepC` which allowing to set the values of the paramters in step $C$ of the procedure.
    - `setGradParamAgg` : The object from the `setGradParameter` function, allowing to set the values of the parameters of **gradient descent** algorithm for the <span style="color: #E6180A;">**1st aggregation method$^1$**</span>.
    - `setGridParamAgg` : The object from the `setGridParameter` function, allowing to set the values of the parameters of the **grid search** algorithm for the <span style="color: #E6180A;">**1st aggregation method$^1$**</span>.
    - `setGradParamMix` : The object from the `setGradParameter_Mix` function, allowing to set the values of the parameters of the **gradient descent** algorithm for the <span style="color: #0832CD">**2nd aggregation method$^2$**</span>.
    - `setGridParamMix` : The object from the `setGridParameter_Mix` function, allowing to set the values of the parameters of the **grid search** algorithm for the <span style="color: #0832CD">**2nd aggregation method$^2$**</span>.
    
- **Value**:

    This function returns a list of the following objects:
    
    - `predict_final` : the final predictions given by the aggregation of all the candidate models.
    - `predict_local` : the predictions given by all the individual candidate models.
    - `agg_method` : the list of aggregation methods obtained in the step $C$ of the procedure.
    - `running_time` : the computational time of the algorithm.

----

  > **Remark.2**: The `parallel` argument above requires internet connection to load the source codes of $K$-means algorithm with BDs from [GitHub <span style="color: #097BC1"> `r fontawesome::fa("github")`</span>](https://github.com/hassothea/KFC-Procedure/blob/master/kmeanBD.R). It is performed on the maximum number of clusters of your machine, and the speed is at least two times faster than without parallelism, however, it is not so stable depending on your internet connection or machine. About the aggregation methods of step $C$,


- <span style="color: #E6180A;">$^1$</span> is the kernel-based consensual aggregation for regression by [Has (2021)](https://hal.archives-ouvertes.fr/hal-02884333v5). The source codes of these functions are available in [KernelAggReg.R](https://github.com/hassothea/AggregationMethods/blob/main/KernelAggReg.R), and the documentation is available [here](https://hassothea.github.io/files/KernelAggReg/KernelAggReg.html).
- <span style="color: #0832CD">$^2$</span> is the aggregation using input-output trade-off by [Fischer and Mougeot (2019)](https://www.sciencedirect.com/science/article/pii/S0378375818302349). The source codes of these functions are available in [MixCobra.R](https://github.com/hassothea/AggregationMethods/blob/main/MixCobraReg.R), and the documentation is available on [here](https://hassothea.github.io/files/KernelAggReg/MixCobraReg.html).

----


```{r}
KFCreg = function(train_input,
                  train_response,
                  test_input,
                  test_response = NULL,
                  n_cv = 5,
                  parallel = TRUE,
                  inv_sigma = sqrt(.5),
                  alp = 2,
                  K_step = stepK(splits = .5),
                  F_step = stepF(),
                  C_step = stepC(),
                  setGradParamAgg = setGradParameter(),
                  setGridParamAgg = setGridParameter(),
                  setGradParamMix = setGradParameter_Mix(),
                  setGridParamMix = setGridParameter_Mix()){
  start_time <- Sys.time()
  lookup_div_names <- c("euclidean",
                         "gkl",
                         "logistic",
                         "itakura",
                         "polynomial")
  div_ <- K_step$div
  ### K step: Kmeans clustering with BDs
  if (is.null(K_step$div)){
    divergences <- lookup_div_names
    warning("No divergence provided! All of them are used!")
  }
  else{
    divergences <- K_step$div %>% 
      map_chr(.f = ~ match.arg(arg = .x, 
                               choices = lookup_div_names))
  }
  div_list <- divergences %>% 
    map(.f = (\(x) if(x != "polynomial") return(x) else return(rep("polynomial", length(K_step$deg))))) %>%
    unlist
  deg_list <- rep(NA, length(div_))
  deg_list[div_list == "polynomial"] <- K_step$deg
  div_names <- map2_chr(.x = div_list,
                        .y = deg_list,
                        .f = (\(x, y) if(is.na(y)) return(x) else return(paste0(x,y))))
  ### Step K: Kmeans clustering with Bregman divergences
  dm <- dim(train_input)
  id_shuffle <- vector(length = dm[1])
  n_train <- floor(K_step$splits * dm[1])
  id_shuffle[sample(dm[1], n_train)] <- TRUE
  if(parallel){
    numCores <- parallel::detectCores()
    doParallel::registerDoParallel(numCores) # use multicore, set to the number of our cores
    kmean_ <- foreach(i=1:length(div_names)) %dopar% {
      devtools::source_url("https://raw.githubusercontent.com/hassothea/KFC-Procedure/master/kmeanBD.R")
      kmeansBD(train_input = train_input,
               K = K_step$K,
               div = div_list[i],
               n_start = K_step$n_start,
               maxIter = K_step$maxIter,
               deg = deg_list[i],
               scale_input = K_step$scale_input,
               splits = K_step$splits,
               epsilon = K_step$epsilon,
               center_ = K_step$center_,
               scale_ = K_step$scale_,
               id_shuffle = id_shuffle)
    }
    doParallel::stopImplicitCluster()
  } else{
    kmean_ <- map2(.x = div_list,
                   .y = deg_list,
                   .f = ~ kmeansBD(train_input = train_input,
                                   K = K_step$K,
                                   div = .x,
                                   n_start = K_step$n_start,
                                   maxIter = K_step$maxIter,
                                   deg = .y,
                                   scale_input = K_step$scale_input,
                                   splits = K_step$splits,
                                   epsilon = K_step$epsilon,
                                   center_ = K_step$center_,
                                   scale_ = K_step$scale_,
                                   id_shuffle = id_shuffle))
  }
  names(kmean_) <- div_names
  ### F step: Fitting the corresponding model on each observed cluster
  model_ <- div_names %>%
    map(.f = ~ fitLocalModels(kmean_[[.x]],
                              train_response = train_response,
                              model = F_step$model,
                              formula = F_step$formula))
  names(model_) <- div_names
  pred_combine <- model_ %>%
    map_dfc(.f = ~ .x$data_remain$fit)
  y_remain <- train_response[!id_shuffle]
  pred_test <- div_names %>%
    map_dfc(.f = ~ localPredict(model_[[.x]],
                                test_input))
  names(pred_test) <- names(pred_combine) <- div_names
  # C step: Consensual regression aggregation method with kernel-based COBRA
  list_method_agg <- list(mixcobra = function(pred){MixCobraReg(train_input = train_input[!id_shuffle,],
                                                                train_response = y_remain,
                                                                test_input = test_input,
                                                                train_predictions = pred,
                                                                test_predictions = pred_test,
                                                                test_response = test_response,
                                                                scale_input = K_step$scale_input,
                                                                scale_machine = C_step$scale_features,
                                                                n_cv = C_step$n_cv,
                                                                inv_sigma = inv_sigma,
                                                                alp = alp,
                                                                kernels = C_step$kernels,
                                                                optimizeMethod = C_step$opt_methods,
                                                                setGradParam = setGradParamMix,
                                                                setGridParam = setGridParamMix)},
                          cobra = function(pred){kernelAggReg(train_design = pred,
                                                              train_response = y_remain,
                                                              test_design = pred_test,
                                                              test_response = test_response,
                                                              scale_input = K_step$scale_input,
                                                              scale_machine = C_step$scale_features,
                                                              build_machine = FALSE,
                                                              machines = NULL,
                                                              n_cv = C_step$n_cv,
                                                              inv_sigma = sqrt(.5),
                                                              alp = 2,
                                                              kernels = C_step$kernels,
                                                              optimizeMethod = C_step$opt_methods,
                                                              setGradParam = setGradParamAgg,
                                                              setGridParam = setGridParamAgg)})
  res <- map(.x = C_step$method,
             .f = ~ list_method_agg[[.x]](pred_combine))
  list_agg_methods <- list(cobra = "cob",
                           mixcobra = "mix")
  names(res) <- C_step$method
  ext_fun <- function(L, nam){
    tab <- L$fitted_aggregate
    names(tab) <- paste0(names(tab), "_", nam)
    return(tab)
  }
  pred_fin <- C_step$method %>%
    map_dfc(.f = ~ ext_fun(res[[.x]], list_agg_methods[[.x]]))
  time.taken <- Sys.time() - start_time
  ### To finish
  if(is.null(test_response)){
    return(list(
    predict_final = pred_fin,
    predict_local = pred_test,
    agg_method = res,
    running_time = time.taken
  ))
  } else{
    error <- cbind(pred_test, pred_fin) %>%
      dplyr::mutate(y_test = test_response) %>%
      dplyr::summarise_all(.funs = ~ (. - y_test)) %>%
      dplyr::select(-y_test) %>%
      dplyr::summarise_all(.funs = ~ mean(.^2))
    return(list(
      predict_final = pred_fin,
      predict_local = pred_test,
      agg_method = res,
      mse = error,
      running_time = time.taken
  ))
  }
}
```

> **Example.4**: A complete KFC procedure is implemented on the same Abalone data, using $6$ BDs `"euclidean"`, `"itakura"`, `"gkl"`, `"logistic"` and `"polynomial"` (of degree $3$ and $6$). Both aggregation methods are used in the step $C$. Two kernel functions are used for each aggregation method: `"gaussian"` (with gradient descent algorithm) and `"epanechnikov"` (with grid search algorithm).

```{r}
train1 <- logical(n)
train1[sample(n,  floor(n*0.8))] <- TRUE
kfc1 <- KFCreg(train_input = df[train1,2:ncol(df)],
                train_response = df$Rings[train1],
                test_input = df[!train1,2:ncol(df)],
                K_step = stepK(K = 3,
                               scale_input = TRUE,
                               div = c("eucl", "ita", "gkl", "log" ,"poly"),
                               deg = c(3, 6),
                               splits = .5),
                C_step = stepC(method = c("cobra", "mixcobra"),
                               opt_methods = c("grad", "grid"),
                               kernels = c("gaussian", "gaussian"),
                               scale_features = FALSE),
                setGradParamAgg = setGradParameter(rate = 1),
                setGridParamAgg = setGridParameter(min_val = .00001,
                                                   max_val = 10,
                                                   n_val = 100),
                setGradParamMix = setGradParameter_Mix(rate = c(1,1)),
                setGridParamMix = setGridParameter_Mix(min_alpha = 1e-10,
                                                       max_alpha = 5,
                                                       min_beta = 1e-10,
                                                       max_beta = 10,
                                                       n_alpha = 20,
                                                       n_beta = 20))
```

> The mean square errors evaluated on $20\%$-testing data of the above computation are reported below.

```{r}
rf1 <- randomForest::randomForest(Rings ~ ., data = df[train1,2:ncol(df)])
kfc1$predict_final %>%
  mutate(rf = predict(rf1, newdata = df[!train1,2:ncol(df)])) %>%
  sweep(MARGIN = 1, STATS = df$Rings[!train1], FUN = "-") %>%
  .^2  %>%
  colMeans
```

---

> <span style="color: #1FAAE3;">&#128214; Read also [KernelAggReg](\files\KernelAggReg.html) and [MixCobraReg](\files\MixCobraReg.html)</span>.

---

- [Has et al. (2021)](https://www.tandfonline.com/eprint/YKGS8GTKDBKYFXEGFWSB/full?target=10.1080/00949655.2021.1891539)
- [Fischer and Mougeot (2019)](https://www.sciencedirect.com/science/article/pii/S0378375818302349)
- [Has (2021)](https://hal.archives-ouvertes.fr/hal-02884333v5)
- [Biau et al. (2016)](https://www.sciencedirect.com/science/article/pii/S0047259X15000950)
- ...
- [dplyr videos](https://www.youtube.com/hashtag/dplyr) `r fontawesome::fa("video")`
- [ggplot2 video tutorial](https://www.youtube.com/hashtag/ggplot2) `r fontawesome::fa("video")`
- [R for data science](https://r4ds.had.co.nz/)

---